Interfacing Sound Stream Segregation to Automatic Speech Recognition - Preliminary Results on Listening to Several Sounds Simultaneously

نویسندگان

  • Hiroshi G. Okuno
  • Tomohiro Nakatani
  • Takeshi Kawabata
چکیده

This paper reports the preliminary results of experiments on listening to several sounds at once. Two issues are addressed: segregating speech streams from a mixture of sounds, and interfacing speech stream segregation with automatic speech recognition (ASR). Speech stream segregation (SSS) is modeled as a process of extracting harmonic fragments, grouping these extracted harmonic fragments, and substituting some sounds for non-harmonic parts of groups. This system is implemented by extending the harmonic-based stream segregation system reported at AAAI-94 and IJCAI-95. The main problem in interfacing SSS with HMM-based ASR is how to improve recognition performance which is degraded by spectral distortion of segregated sounds caused mainly by the binaural input, grouping, and residue substitution. Our solution is to re-train the parameters of the HMM with training data binauralized for four directions, to group harmonic fragments according to their directions, and to substitute the residue of harmonic fragments for non-harmonic parts of each group. Experiments with 500 mixtures of two women’s utterances of a word showed that the cumulative accuracy of word recognition up to the 10th candidate of each woman’s utterance is, on average, 75%. Introduction Usually, people hear a mixture of sounds, and people with normal hearing can segregate sounds from the mixture and focus on a particular voice or sound in a noisy environment. This capability is known as the cocktail party effect (Cherry 1953). Perceptual segregation of sounds, called auditory scene analysis, has been studied by psychoacoustic researchers for more than forty years. Although many observations have been analyzed and reported (Bregman 1990), it is only recently that researchers have begun to use computer modeling of auditory scene analysis (Cooke et al. 1993; Green et al. 1995; Nakatani et al. 1994a). This emerging research area is called computational auditory scene analysis (CASA) and a workshop on CASA was held at IJCAI-95 (Rosenthal & Okuno 1996). One application of CASA is as a front-end system for automatic speech recognition (ASR) systems. Hearing impaired people find it difficult to listen to sounds in a noisy environment. Sound segregation is expected to improve the performance of hearing aids by reducing background noises, echoes, and the sounds of competing talkers. Similarly, most current ASR systems do not work well in the presence of competing voices or interfering noises. CASA may provide a robust front-end for ASR systems. CASA is not simply a hearing aid for ASR systems, though. Computer audition can listen to several things at once by segregating the sounds. This capability to listen to several things simultaneously has been called the Prince Shotoku effect by Okuno (Okuno et al. 1995) after Prince Shotoku (574–622 A.D.) who is said to have been able to listen to ten people’s petitions at the same time. Since this is virtually impossible for humans to do, CASA research would make computer audition more powerful than human audition, similar to the relationship of an airplane’s flying ability to that of a bird. At present, one of the hottest topics of ASR research is how to make more robust ASR systems that perform well outside laboratory conditions (Hansen et al. 1994). Usually the approaches taken are to reduce noise, use speaker adaptation, and treat sounds other than human voices as noise. CASA takes an opposite approach. First, it deals with the problems of handling general sounds to develop methods and technologies. Then it applies these to develop ASR systems that work in a real world environment. In this paper, we discuss the issues concerning interfacing of sound segregation systems with ASR systems and report preliminary results on ASR for a mixture of sounds. Sound Stream Segregation Sound segregation should be incremental, because CASA is used as a front-end system for ASR systems and other applications that should run in real-time. Many representations of a sound have been proposed, for example, auditory maps (Brown 1992) and synchrony strands (Cooke et al. 1993), but most of them are unsuitable for incremental processing. Nakatani and Okuno proposed using a sound stream (or simply stream) to represent a sound (Nakatani et al. 1994a). A sound stream is a group of sound components that have some consistent attributes. By using sound streams, the Prince Shotoku effect can be modeled as shown in Fig. 1. Sound streams are segregated by the sound segregation system, and then speech streams are selected and passed on to the ASR systems. Sound stream segregation consists of two subprocesses: 1. Stream fragment extraction — a fragment of a stream that has the same consistent attributes is extracted from a mixture of sounds.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Interfacing Sound Stream Segregation to Recognition - Preliminar Several Sounds Si

This paper reports the preliminary results of experiments on listening to several sounds at once. ‘Ike issues are addressed: segregating speech streams from a mixture of sounds, and interfacing speech stream segregation with automatic speech recognition (AD). Speech stream segregation (SSS) is modeled as a process of extracting harmonic fragments, grouping these extracted harmonic fragments, an...

متن کامل

A new speech enhancement: speech stream segregation

Speech stream segregation is presented as a new speech enhancement for automatic speech recognition. Two issues are addressed: speech stream segregation from a mixture of sounds, and interfacing speech stream segregation with automatic speech recognition. Speech stream segregation is modeled as a process of extracting harmonic fragments, grouping these extracted harmonic fragments, and substitu...

متن کامل

Sound Ontology for Computational Auditory Scence Analysis

This paper proposes that sound ontology should be used both as a common vocabulary for sound representation and as a common terminology for integrating various sound stream segregation systems. Since research on computational auditory scene analysis (CASA) focuses on recognizing and understanding various kinds of sounds, sound stream segregation which extracts each sound stream from a mixture o...

متن کامل

Challenge Problem for Computational Auditory Scene Analysis: Understanding Three Simultaneous Speeches

Understanding three simultaneous speeches is proposed as a challenge problem to foster arti cial intelligence, speech and sound understanding or recognition, and computational auditory scene analysis research. Automatic speech recognition under noisy environments is attacked by speech enhancement techniques such as noise reduction and speaker adaptation. However, the signal-to-noise ratio of sp...

متن کامل

Understanding Three Simultaneous Speeches

Understanding three simultaneous speeches is proposed as a challenge problem to foster artificial intelligence, speech and sound understanding or recognition, and computational auditory scene analysis research. Automatic speech recognition under noisy environments is attacked by speech enhancement techniques such as noise reduction and speaker adaptation. However, the signal-to-noise ratio of s...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1996